Promoting Open Science
in Learning Research

Collaborative Exploration of Future Directions



Jürgen Schneider
Caspar J. Van Lissa, Marjan Bakker and Olmo van den Akker

EARLI SIG 4/17

20 September 2024

Overview of the session


  1. Introduction to 3 open science aspects (10min each)
    • Preregistration & Registered Reports
    • Open and FAIR data
    • Open and Reproducible Code

  2. Subgroup Discussions (60min)
    • One group for each OS aspect
    • Applying a model for managing complex change”
    • For each OS aspect: Where do you see the greatest need for change?
    • What specific actions can we take or initiate?

  3. Bringing Ideas Together, Plenary Discussion (30min)

What is open data?


Definition

  • anyone
  • can readily access the data
  • at no more than a reasonable reproduction cost (i.e., internet connection)

(Open Knowledge Foundation, 2023)




But…


  • Openness is not a dogma and not a dichotomy

“As open as possible as closed as necessary” (European Commission, 2023, p. 36)

What is FAIR data?



Purpose


“enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals” (Wilkinson et al., 2016, p. 1)

See also go-fair.org




FAIR vs. open


“does not necessarily mean that data has to be “open” […] even highly protected data can be FAIR data”
(Kraft, 2023)

What is FAIR data?



Principle Example.of.Challenge
Findable You provided your data online, but others don't know of it or can't find it.
Accessible Others found your data,
  ...but can't download it.
  ...don't know if or in what ways they may use it.
Interoperable Others found and accessed your data, but can't open or work with it.
Reusable Others found, accessed and opend your data, but don't understand the data set.

Potential impact and benefits


For science: Reuse

  • FAIR data requirement for
  • Use in meta-analyses

  • Answering new research questions

  • Teaching / student theses

  • As historical artefact (mostly qualitative data)

Potential impact and benefits


For researchers

Getting cited

Getting funded

  • Funders 🇪🇺 ERC (2022) & 🇩🇪 DFG (2015) require open data
  • Federal agencies (IEA.2022?) and scientific societies (APA.2017?) endorse open sharing

Getting published

Getting hired

  • new metrics for evaluation on the rise (CoARA, 2022)
    Signatories: ERC, League of European Research Universities, European University Association …

Examples of good practices

The problem:
Just because we provide data online, doesn’t mean that others will find it.

We could have the greatest data set to answer further research questions - if our colleagues don’t know it exists or can’t locate the data, openness will be of little value.

The solutions:

  • Get a persistent identifier (e.g., DOI), where you provided your data
  • Mention DOI in publication that builds on this data (e.g., in the “data accessibility statement”)
  • Describe your data as richly as possible (metadata). Research data centers offer form fields tailored to the discipline or data type. With repositories use alternative possibilities, such as keyword fields.
    • e.g., which variables does the quantitative data set contain?
    • e.g., which topics does your data cover?
    • e.g., which population did you draw your sample from?

The problem:
Just because others find our data doesn’t mean the access barriers are as low as possible and doesn’t mean they know in which way they are allowed to access it. Examples:

  • Providing a link to the data in the text of a paywalled journal article
  • Unclear licensing / use conditions when providing data (e.g., are non-researchers allowed to access the data or is it only open for qualified researchers?)

The solutions:

  • Make sure access is free of charge (or as cheap as possible)
    • e.g., by providing link to data in publicly accessible sections of journal articles that are not open access
    • e.g., by using repositories or research data centers that allow access free of charge
  • Make sure users know if they can access and under which conditions
    • e.g., research data centers ensure that terms of use are clear (who may access under what conditions) and offer different levels of access restriction
    • e.g., on repositories provide a readme-file and an open license (e.g., CC0, CC-BY, CC-BY-SA) with data sets for access cases

The problem:
Just because others downloaded our data doesn’t mean they can open and manipulate it.

The solutions:

  • Use file formats with open licenses
    • e.g., tabular data: CSV (with additional labelling script), RData
    • e.g., text data: PDF, HTML, ODT, RTF
  • Make sure users know how different files are related to one another
    • e.g., define which file contains student data and which teacher data
    • e.g., define which file contains data from cohort 1 and which cohort 2, …

The problem:
Just because others opened our data doesn’t mean they understand the data and its use-conditions. Examples:

  • Others can’t understand what the column names of the tabular data set mean: Which columns in the data set relate to which variables in the journal article?
  • Can someone from sociology use the data set from psychology they found on osf.io?
  • Does someone reusing a data set have to cite the authors?

The solutions:

  • Adhere to standards in folder organization
    • e.g., PSYCH-DS (see technical specification draft)
  • Rich description/explanation of what user will find in the data set (≠ meta descriptions about the data set as a whole, as for accessibility)
    • e.g., provide a codebook. How to semi-automatically create a codebook, see the R package codebook
  • Provide a license for the use-cases
    • again, research data centers ensure that terms of use are clear (who may use under what conditions)
    • again, on repositories provide a readme-file and an open license (e.g., CC0, CC-BY, CC-BY-SA) with data sets for the use-cases

Examples of good practices


tldr;

What typical data sharing might look like:

ToDo Tools.Options
describe what's in the data quantitative: Create codebook via codebook R package
qualitative: provide methods and/or field report
share via repository/ research data center (rdc)
that offers DOI
🇪🇺 Zenodo (repository; option: visibility restricted)
🇺🇸 OSF (repository; option: project private)
🇺🇸 Open ICPSR (rdc)
🇬🇧 UK Data Archive (rdc)
🇩🇪 VerbundFDB (rdc)
connect DOI with paper put in section 'data availability', 'open practices',
'supplemental material', ...

Examples of good practices

Open and Reproducible Code

Why Reproducibility?

  • Every analysis has “inductive bias” (Sterkenburg & Grünwald, 2021)
    • What we learn from the data depends, in part, on how we analyze it
  • Implicit steps in the analysis make inductive bias intractable (Peikert, 2023)
    • We don’t know how, or how much, they influenced our results
  • Reproducibility makes inductive bias tractable
  • Reproducible code enables quantifying and studying inductive bias
  • Ideally, your whole analysis is a sealed “pipeline”: data in, results out
  • Multiverse analysis: conducting a study of the impact of all “reasonable” analysis decisions on the estimand/conclusion (Steegen et al., 2016)
  • Reproducibility -> Scalability
    • Apply same method in new study
    • Redo analysis when new data come in
    • Incorporate analysis into application for stakeholders, etc

Reproducibility is Challenging

Where do you start?

What tools do you need to learn?

What workflow is right for you?

Introducing WORCS

Workflow for Open Reproducible Code in Science

  • Standardized workflow
  • Low threshold, high ceiling
  • Conceptual platform-independent principles: https://psyarxiv.com/k4wde
  • “One-click” solution for R-users: https://cran.r-project.org/package=worcs
  • Defaults based on best practices (several experts contributed)
  • Compatible with journal/university requirements and other workflows
  • Pulling down the learning curve!

learningcurve

The tools

1. Dynamic document generation

2. Version control

3. Dependency management

1. Dynamic document generation

  • Paper consists of text and code
  • Results, figures, and tables automatically generated
  • Formatted as APA paper (including citations!)

Important because:

  • Save time from copy-pasting output and formatting paper
  • Eliminate human error in copying results;
  • When revising the paper, all results are automatically updated;
  • Reproducible by default: Just generate the document

R Markdown example

rmarkdown

R Markdown example rendered

data-analysis citation

2. Version control (using Git)

Why version control?

  • NO MORE manuscript_final_final_SERIOUSLYFINAL.doc

  • “Track Changes” on steroids: record entire project history

  • If something breaks, you can figure out what happened.

  • Facilitates collaboration and experimentation!

2. Version control (using Git)

Tracks changes to (text-based) files line by line:

  • add files to your repository
  • commit changes to these files
  • push all commits to remote repository (private backup or public online supplement)

One command in worcs: git_update("Describe your changes")

Image credit: Software Carpentries

Introducing GitHub

  • worcs repository is backed up in a remote repository like GitHub;

  • GitHub is a “cloud backup” with “social networking” features

    • Clone other people’s repository to reproduce or build upon them
    • Open Issues with questions or comments about the work
    • Send suggested changes as a “Pull request”
  • GitHub can be used to ‘tag’ specific states of the repository, e.g. a preregistration. ]

Important because:

  • Complete backup of entire project history
    • Go back to previous version if you want
    • Try new things, don’t worry about losing work
    • Prove that you preregistered your plans and followed them
  • Easy collaboration online (even with strangers)
    • People can copy your project and build on it
  • GitHub can be your preregistration, your research archive, supplementary materials, comments section, etc.
  • Connects to OSF.io project page
    • Improves Findability
    • Get DOI for project and/or specific resources
  • Connects to Zenodo
    • Get DOI for project and/or specific resources
    • Store project snapshot

3. Dependency management

  • To make project reproducible, people must have access to your (exact) software dependencies
    • For R-users, these are R-packages
  • Difficult trade-off:

Dependency management in WORCS

  • Maintains text-based list of packages, their version,
    and origin (e.g., “CRAN”, “Bioconductor”, “GitHub”)
  • This list can be version-controlled with Git;
  • When a user loads the project,
    renv installs all dependencies from the list

Important because:

  • Essential for reproducibility
  • Good for collaboration (everybody has same versions)
  • Nice to your “future self”: Your code will work in the future

Unique features in worcs

  • RStudio template
  • Automatic installation check: check_worcs_installation()
  • Easy GitHub integration
    • Add URL during project creation
    • git_update("Commit message")
    • Automatically reproduce results in the cloud!
  • Manuscript and preregistration templates
    • From rticles, papaja, and prereg
    • Original templates for secondary- and longitudinal data
  • Data sharing solutions
  • Cite @essential and @@nonessential
  • Integration with targets
  • WORCS checklist and badge

Sharing data in WORCS

  • Reproducibility requires open data
  • Some data may be (privacy) sensitive
    • E.g., children’s data, veterans’ data, patient data

Use open_data():

  • Original data made public
  • Default is a .csv (text based, human / machine readable)
  • Other save / load functions can be used

Use closed_data():

  • Original data saved locally;
  • Synthetic data created using synthetic()
  • Synthetic data made public (default: .csv)
  • Unique ID of original data made public (so people can audit your work)

Sharing data in WORCS

Loading data load_data():

  • If original data are present, load them…
  • Else, load synthetic data
  • Scripts can thus ALWAYS be reproduced
  • People can create a working script using synthetic data, and send it to you to run on original data
  • Load function recorded in .worcs file; default read.csv()

Reproducing WORCS Project

  1. Create entry point (e.g., manuscript.Rmd)
  2. Define recipe (e.g., rmarkdown::render("manuscript.Rmd"))
  3. Snapshot endpoints recipe (e.g., manuscript.pdf, table1.csv)

worcs::reproduce() generates the endpoints from the entry point via the recipe

worcs::check_endpoints() verifies that the results are identical

Continuous Integration

Run worcs::reproduce() on GitHub via GitHub Actions

targets Integration

Targets creates a pipeline for computationally intensive workflows

  • Each step is only re-run if:
    • The step changed
    • Its inputs changed

This is perfectly compatible with worcs!

Using targets

  • Select “Use targets” when creating a WORCS project
  • Use worcs::add_targets()
  • Select “Target Markdown” as output format

A targets workflow is executed by running targets::tar_make()

  • worcs sets the recipe to targets::tar_make(), so worcs::reproduce() runs it
  • worcs makes sure that the last step of the pipeline is to render an Rmarkdown to report the results

For non-R-users

  • WORCS-paper addresses the conceptual workflow
  • Covers issues/decisions you have to consider for Open Science, regardless of software
  • worcs is a good starting point for new R-users
    • Setup Tutorial to help install everything
    • Tricky issues (like project management and using Git) are ~automatic when using the WORCS template
    • Automatic check in case you get stuck: check_worcs_installation()
  • Learn good habits from the start; don’t reinvent the wheel

Find out more:

cjvanlissa.github.io/worcshop
cjvanlissa.github.io/worcs

Introduction of Subgroup Discussions

  • Short introduction to Model for Managing Complex Change (Lippitt-Knoster)
  • Explanation of how we document the outcomes of subgroup discussions
    • Link to document prereg: https://t1p.de/sig-prereg
    • Link to document open data: https://t1p.de/sig-data
    • Link to document open code: https://t1p.de/sig-code
  • If you want to connect: Put your Name and email at the bottom of the google docs document

Thank you



Jürgen Schneider

References

Artner, R., Verliefde, T., Steegen, S., Gomes, S., Traets, F., Tuerlinckx, F., & Vanpaemel, W. (2021). The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods, 26(5), 527–546. https://doi.org/10.1037/met0000365
CoARA. (2022). Agreement on Reforming Research Assessment.
Colavizza, G., Hrynaszkiewicz, I., Staden, I., Whitaker, K., & McGillivray, B. (2020). The citation advantage of linking publications to research data. PLOS ONE, 15(4), e0230416. https://doi.org/10.1371/journal.pone.0230416
DFG. (2015). DFG Guidelines on the Handling of Research Data.
ERC. (2022). Open Research Data and Data Management Plans. Information for ERC grantees.
Errington, T. M., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Challenges for assessing replicability in preclinical cancer biology. eLife, 10, e67995. https://doi.org/10.7554/eLife.67995
European Commission. (2023). Horizon Europe (HORIZON). HE Programme Guide. Version 4.0. Publications Office.
Kraft, A. (2023). The FAIR Data Principles. https://doi.org/10.23668/PSYCHARCHIVES.13577
Open Knowledge Foundation. (2023). What is Open Data? In Open Data Handbook. https://opendatahandbook.org/guide/en/what-is-open-data/.
Peikert, A. (2023). Towards Transparency and Open Science [doctoralThesis, Humboldt-Universität zu Berlin]. https://doi.org/10.18452/27056
Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. https://doi.org/10.7717/peerj.175
Schneider, J., Rosman, T., Kelava, A., & Merk, S. (2022). Do Open-Science Badges Increase Trust in Scientists Among Undergraduates, Scientists, and the Public? Psychological Science, 33(9), 1588–1604. https://doi.org/10.1177/09567976221097499
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing Transparency Through a Multiverse Analysis. Perspectives on Psychological Science, 11(5), 702–712. https://doi.org/10.1177/1745691616658637
Sterkenburg, T. F., & Grünwald, P. D. (2021). The no-free-lunch theorems of supervised learning. Synthese, 199(3-4), 9979–10015. https://doi.org/10.1007/s11229-021-03233-1
Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results. PLoS ONE, 6(11), e26828. https://doi.org/10.1371/journal.pone.0026828
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18

Credit

Title page: OpenClipart-Vectors on www.pixabay.com

Icons by Font Awesome CC BY 4.0